Sublinear regret for learning POMDPs

نویسندگان

چکیده

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is optimal policy of POMDP with a known environment in terms average reward over an infinite horizon. propose algorithm this problem, building on spectral method-of-moments estimations hidden models, belief error control POMDPs and upper confidence bound methods online learning. establish regret O ( T 2 / 3 log ) $O(T^{2/3}\sqrt {\log T})$ proposed where This is, to best our knowledge, first achieving sublinear respect general POMDPs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...

متن کامل

Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The budget and time constraints significantly increase the complexity of exploration-exploitation tradeoff because they introduce coupling among contexts. Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. T...

متن کامل

Regret Bounds for Lifelong Learning

We consider the problem of transfer learning in an online setting. Di↵erent tasks are presented sequentially and processed by a within-task algorithm. We propose a lifelong learning strategy which refines the underlying data representation used by the withintask algorithm, thereby transferring information from one task to the next. We show that when the within-task algorithm comes with some reg...

متن کامل

Sublinear-Time Optimization for High-Dimensional Learning

Across domains, the scale of data and complexity of machine learning models have both been increasing greatly in recent years. For many such large scale models of interest, tractable estimation without access to extensive computational infrastructure is still an open problem. In this thesis, we approach the tractability of large-scale estimation problems through the lens of leveraging sparse st...

متن کامل

Scalable Planning and Learning for Multiagent POMDPs

Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable appr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Production and Operations Management

سال: 2022

ISSN: ['1059-1478', '1937-5956']

DOI: https://doi.org/10.1111/poms.13778